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Abstract 

Due to the growing needs of motion capture (mocap) in movie, video games, sports, etc., it is highly desired 
to compress mocap data for efficient storage and transmission. Unfortunately, the existing compression 
methods have either high latency or poor compression performance, making them less appealing for time- 
critical applications and/or network with limited bandwidth. This paper presents two efhcient methods to 
compress mocap data with low latency. The first method processes the data in a frame-by-frame manner 
so that it is ideal for mocap data streaming. The second one is clip-oriented and provides a flexible trade¬ 
off between latency and compression performance. It can achieve higher compression performance while 
keeping the latency fairly low and controllable. Observing that mocap data exhibits some unique spatial 
characteristics, we learn an orthogonal transform to reduce the spatial redundancy. We formulate the 
learning problem as the least square of reconstruction error regularized by orthogonality and sparsity, and 
solve it via alternating iteration. We also adopt a predictive coding and temporal DOT for temporal 
decorrelation in the frame- and clip-oriented methods, respectively. Experimental results show that the 
proposed methods can produce higher compression performance at lower computational cost and latency 
than the state-of-the-art methods. Moreover, our methods are general and applicable to various types of 
mocap data. 

Keywords: Motion capture, data compression, transform coding, low latency, optimization 


1. Introduction 

As a highly successful technique, motion capture (mocap) has been widely used to animate virtual 
characters in distributed virtual reality applications and networked games [l|, Q . Due to the large amount of 
data and the limited bandwidth of communication network, congestion, packet loss, and delay often occur 
in mocap data transmission. Therefore, mocap data compression, specially lossy compression, is necessary 
to facilitate storage and transmission. 
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Thanks to its smooth and coherent nature, mocap data exhibits high degree of temporal and spatial 
redundancy, making compression possible. To date, many mocap compression algorithms have been proposed 
(see Section [2|). Among these approaches, most are sequence-based (e.g., iiiiiii) in that they 
process all the frames of a mocap sequence at a time. These methods are able to achieve high compression 
performance. However, such a good compression performance comes at a price of high latency, i.e., a large 
number of frames have to be captured and stored before compression, making them more suitable for efficient 
storage. On the other hand, the frame-based (e.g., 0) approaches aim at time-critical applications (e.g., 
interactive applications) due to their no-latency nature. Unfortunately, the existing frame-based methods 
have poor compressing performance compared with the sequence-based methods, since they cannot explore 
spatial and temporal correlation well. As none of the sequence- and frame-based methods is perfect, it is 
natural to consider the clip-based (e.g., PEP) methods which segment mocap data into short clips, 
providing a trade-off between latency and compression performance. 

In this paper, we present two efficient methods for compressing mocap data with low latency. The first 
method processes the data in a frame-by-frame manner, hereby compressing the data without any inherent 
latency at all. The second one is clip-based and can achieve higher compression performance while keeping 
the latency fairly low and controllable. Since mocap data exhibits some unique spatial characteristics, we 
propose a learned spatial decorrelation transform (LSDT) to explore the spatial redundancy. Taking the data 
content into account, the LSDT learns an orthogonal matrix via an ^o-norm regularized optimization. Due to 
its data adapted nature, the proposed LSDT outperforms the commonly used data-independent transforms, 
such as discrete cosine transform (DOT) and discrete wavelet transform (DWT), in terms of compression 
performance. We also adopt a predictive coding and temporal DCT for temporal decorrelation in the frame- 
and clip-based methods, respectively. We observe promising experimental results and demonstrate that 
our methods can produce higher compression performance at lower computational cost and latency than 
state-of-the-art. 

The rest of this paper is organized as follows: Section [2] comprehensively reviews previous work on 
mocap data compression. Section [3] gives the proposed frame- and clip-based methods. Section |4] shows the 
key component of the proposed methods, i.e., the learned spatial decorrelation transform, followed by the 
experimental results and discussion in Section [5] Finally, Section [5] concludes this paper. 


2. Related Work 

All compression schemes aim at exploiting correlations among the data, so does mocap data compres¬ 
sion. In terms of decorrelation techniques, the existing mocap data compression algorithms can be roughly 
classified into four groups, which are reviewed and analyzed as follows. 

2.1. Principal Component Analysis (PCA) 

As a very popular technique, principal component analysis projects the data onto few principal orthogonal 
bases to convert data into a smaller set of values of linearly uncorrelated data. 

Breaking the mocap database into short clips that are approximated by Bezier curves, Arikan [lH per¬ 
formed clustered PCA to reduce their dimensionality. Liu and McMillan [1^ projected only the keyframes 
on the PCA bases and interpolated the other frames via spline functions. Motivated by the repeated char¬ 
acteristics of human motions, Lin et al. @ projected similar motion clips into PCA space and approximated 
them by interpolating functions with range-aware adaptive quantization. Observing that distortion to each 
of the joints causes a different overall distortion, Vasa and Brunnett 0] proposed perception-driven error 
metric so that important joints have a higher precision than that of joints with small impact. They presented 
a Lagrange multiplier-based preprocessing for adjusting the joint precision. After Lagrangian equalization, 
the entire mocap sequence is projected into PCA pose space. Then, PCA is applied to short clips for further 
reducing the temporal coherence. 

Principal geodesic analysis (PGA) is a generalization of PCA for handling the case where the data is 
sampled from curved manifolds. Tournier et al. Q presented a PGA-based method for the poses manifold in 
the configuration space of a skeleton, leading to a reduced, data-driven pose parameterization. Compression 
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is then obtained by storing only the approximate parameterization along with the end-joints and root-joints 
trajectories. 

Although PCA can decorrelate mocap data very well, its bases are data-dependent and usually difficult to 
compress. Therefore, one has to explicitly store the orthogonal bases, which reduces the overall compression 
performance. Furthermore, PCA is usually applied to the whole mocap sequence (e.g., [ 30 ), resulting in 
a high latency. 


2.2. Discrete Wavelet and Cosine Transforms 

DCT and DWT are commonly used techniques for converting correlated data into frequency domain, in 
which energy mainly concentrates on sparse frequencies (or most transform coefficients tend to zero). DCT 
and DWT have been widely adopted in some video/image coding standards [ 3 ,[ 3 - Moreover, they also 


have been exploited in the compression of 3D geometric data, e.g., static/dynamic meshes [l7|,llall9| and 
mocap data [ 2 ^, [3. fltill . 

Kwak and Bajic [ij applied ID DCT to the predictive residuals between consecutive frames for exploiting 
the spatial coherence. In contrast, Preda et al. [ 2 ^ applied ID DCT/DWT to the residuals of motion 
compensation along the temporal dimension. Beaudoin et al. [ 3 | and Firouzmanesh et al. ( 3 | adopted 
ID DWT to trajectories of degrees of freedom and selected the sparse wavelet coefficients by a perceptual- 
based metric. Observing that neither ID DCT nor ID DWT considers the spatial and temporal correlation 
simultaneously. Chew et al. [3l used Fuzzy C-means clustering to represent the mocap clips as 2D arrays, 
on which 2D DWT was ^plied. 

As pointed out in Q, mocap data have some unique features that distinguish them from natural 
videos/images. For example, applying ID DCT/DWT to each trajectory produces sparsity in the transform 
domain, since each trajectory is a smooth spatial curve. However, it does not make sense to apply ID 
DCT/DWT to each mocap frame due to the lack of smoothness in the frame (see the analysis in Section!?]). 

2.3. Mocap Data Favored Transforms 

As general-purpose transforms, DWT and DCT are d&ta,-independent so that one does not need to 
store the bases. In contrast, data-driven transforms are adaptive to the input data, thus, they can take 
advantage of their intrinsic structure. However, the adaptiveness comes at a price of storing the basis 
functions explicitly. 

Zhu et al. [3| proposed an elegant sparse decomposition model for the quaternion space that decomposes 
human rotational motion into a dictionary part and a weight part. As a result, a linear combination 
of 3D motion is equivalent to quaternion multiplication and the weight of linear combination is a power 
operation on quaternion. They showed that the transformed weights are sparse, leading to good compression 
performance. However, the quaternion space sparse representation is computationally expensive, diminishing 
its application to long motion sequences. Hou et al. represented a motion sequence as a third-order tensor, 
which exhibits strong correlation within and across its slices. They performed the canonical polyadic (CP) 
tensor decomposition to explore correlation within and among clips to realize dimensionality reduction. 
Recently, Hou et al. 0 proposed the mocap data tailored transform (MDTT), which partitions the input 
motion into clips and then computes a set of data-dependent orthogonal bases by minimizing the least square 
of distortions. Computational results show that MDTT significantly outperforms the existing techniques 
(e.g., pa 13,0) in terms of both compression performance and runtime. However, due to the overhead 
of explicitly storing the orthogonal bases, MDTT is less appealing for the short motion sequence. Note that 
all of the above-mentioned methods 23|, 0 0 have very high latency due to their sequence-based nature. 


2 . 4 . Indexing-based Methods 

Chattopadhyay et al. 0 proposed a smart indexing algorithm for exploiting structural information 
derived from the human skeleton, where each floating point number is represented as an integer index, based 
on the statistical distribution of the floating point numbers in a motion matrix. Gu et al. 0 organized the 
markers into a hierarchy where each node corresponds to a meaningful part of the human body and coded 
each body part separately. Then, the motion sequence is represented as a series of motion pattern indices 
with respect to a predefined dataset including various patterns. 
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(a) Frame-based method 



(b) Clip-based method 

Figure 1: The flowcharts of the proposed frame- and clip-based methods. 


3. Overview 

Given a mocap sequence of F frames, we denote its i-th frame by mf = \d\ • djY S where J 

is the number of key points (markers) and d := {x, y, z} stands for the d-dimensional coordinate. Then the 
d-component of the motion sequence is represented by a J-hy-F matrix = [mf ■ • • m^] € 

Each row of corresponds to the d-trajectory of a key point. We partition M'^ into non-overlapping clips 
of equal length, denoted by G where L is the clip length. 

The primary goal of data compression is to reduce redundancy or correlation in the data. As pointed 
out in [8|, a typical mocap sequence exhibits strong spatial correlation due to the highly coordinated and 
structured nature of key points, and strong local temporal correlation since the object moves smoothly at 
a relatively small time scale. Therefore, mocap compression aims at eliminating both types of correlation 
as much as possible. In following sections, we present two low-latency and high-efficiency methods for 
compressing mocap data. 


3.1. Frame-based Method 

As shown in Figure |l(a)[ the frame-based method processes one frame at a time so that there is no 
inherent latency at all. Let us denote the basis functions of the learned spatial decorrelation transform 
(LSDT) (to be presented in Section 0]). For the first frame mi, we use to remove its spatial correlation, 
i.e., 

= B^m^f. (1) 

Then we adopt a simple predictive coding to the following frames to eliminate the temporal redundancy: 
the i-th frame is predicted only from the previous reconstructed one 

rf = mf-mti, (»>2) (2) 


where mf_i is the reconstructed (* — l)-th frame, which is obtained by inverse quantization and inverse 
LSDT. Then, applying the spatial decorrelation transform B'^ on the residual vector rf, we obtain 


cf = B^rl 


( 3 ) 


where cf G are the transformed coefficients. 
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Finally, we perform the hard thresholding operation and uniform quantization on cf. We store the 
following information for reconstruction: (1) the locations and values of nonzero elements, which are fur¬ 
ther entropy-coded using lossless coding, i.e., Huffman codes; (2) the number of nonzero elements in each 
coefficient vector, which is encoded using fixed-length encoding. 

3.2. Clip-based Method 

The frame-based scheme has no inherent latency at the price of relatively low compression performance, 
since it cannot fully exploit the temporal coherence. The clip-based scheme, in contrast, processes L consec¬ 
utive frames at a time, leading to better temporal decorrelation. With a proper L, the clip-based algorithm 
is a trade-off between latency and compression performance. 

Figure [T(b)| shows the flowchart of the clip-based scheme. Let M'^ £ be a clip of length L. Each 

row of corresponds to the d dimensional trajectory of a key point, i.e., a spatial curve. Thus, applying 
the ID DOT to the rows of M'^ to explore the temporal correlation (see the analysis in Section^]), we obtain 

= M'^Ut, (4) 

where U* £ is the ID DCT matrix. We then apply the LSDT to to further remove its spatial 

redundancy, ^ 

Cd = (5^ 

Finally, we adopt the same quantization and entropy coding used in the frame-based method to encode 
into bit stream. The sequence can be reconstructed by inverse quantization and inverse transform. 


4. Learned Spatial Decorrelation Transform 


DCT and DWT decorrelate the data by converting it from spatial domain to frequency domain in a sparse 
form. They have been widely used for image and video compression [l^,[23|- DCT is suitable for signals which 
can be approximately modeled as a first-order Markov process (Markov-I) with the correlation coefficient 1, 
while DWT is particularly desired to piecewise signals 23 . Note that each row of corresponds to the 
d-dimensional trajectory of a key point, which can be viewed as Markov-I. Thus, it is reasonable to employ 
DCT to exploit the coherence within them. However, since the key points are organized in an irregular, 
tree-like structure (i.e., skeleton graph), the elements of mf may not be correlated with their neighbors, 
meaning that columns of IVT^ do not follow Markov-I. Also note that the columns of do not exhibit 
the piecewise smooth characteristic either. As a result, it does not make sense to apply DCT or DWT for 
de-correlation among the rows of M'^. We refer readers to Q for quantitative analysis. As pointed out in 
Hill, 2^, mocap data lies in a relatively lower dimensional space, which are spanned by a set of specific 


bases. Based on the above analysis, we propose to learn an orthogonal transform to span the subspace of 
mocap data as much as possible. 

Given N training frames the learned spatial decorrelation transform (LSDT) aims 

at finding an orthogonal matrix B £ so that it can transform each training frame into a sparse vector. 

We formulate the learning problem as follows: 


N 


min , 




subject to 


B'^B'^^ = B'^^B'^ = I, 


lo — 


<P, 


( 6 ) 


where the t'o-norm ||ei||o counts the number of non-zero entries in the transform coefficient of the Ath 
training sample, P is the user-specified parameter controlling the sparsity in e^, and I £ is the 

identity matrix. The orthogonality constraint on B'^ allows us to obtain the inverse LSDT easily. Observe 
that the optimization problem in Equation (|5|) is non-convex due to the non-convex constraints. We develop 
an alternating iterative method, which alternately solves the following two subproblems until convergence. 
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4.I. The Sparse Vector Subproblem 

With fixed B^, let gf = . The sparse vector subproblem is equivalent to the summation of multiple 

independent univariate minimization problems, in which the i-th one is written as 

min ||gf - efll^ subject to ||ef|| < P. (7) 

{e?} 

Obviously, the minimization is achieved only when ef contains the largest P entries (in magnitude) of gf 
which are at the corresponding locations. Therefore, we can compute ef by setting the (J — P) smallest (in 
magnitude) entries of B'^m)* to zero: 

ef = T{gf,J-P), (8) 

where T is the truncating operation. 


4-2. The Orthogonal Matrix Subproblem 

Given fixed sparse vectors ef,i = l,...,N, let us denote = [ef,..., e^] the matrix representation. 
The orthogonal matrix subproblem is 

min llB'^M'^ - subject to B'^B'^^ = B'^^B'^ = I, (9) 

Bd II ll.f' 

where jj • ||f is the Frobenius norm of matrix and is the matrix representation of all training frames. 
Observe that 

= Tr - E'^) - E”^)^) 

= Tr - 2Tr + Tr (^E'^E'^^) , 

where Tr is the matrix trace. 

Ignoring the first and third terms which are constant, the minimization problem in ([9]) is equivalent to 
maxTr subject to B'^B'^^ = B'^^B'^ = I. (10) 

Factoring using the singular value decomposition (SVD), we obtain where 

Ud, G ]gdxJ g^j-g orthogonal matrices, and is a diagonal matrix. 

Then we can rewrite the objective function as 

Tr = Tr = Tr , 

where B'^ = is still an orthogonal matrix. 

Since is a diagonal matrix, maximizing (11011 is equivalent to maximize the diagonal entries of 
With Cauchy-Schwartz inequality, the i-th diagonal entry of B'^U'^ is 


1=1 


\ 




1=1 


1=1 


The last equation comes from the fact that both B and U are orthogonal matrices. Therefore, the objective 
function in (flUl) is maximized when B'^U'^ = I, leading to 

Algorithm [T] shows the pseudocode of the LSDT algorithm. In each iteration, the truncating operation 
(line 4) and matrix multiplication (lines 6 and 7) take 0(Jlog J) and 0{2NJ‘^) time, respectively. Singular 
value decomposition has an O( J^) time complexity. Putting it all together, the time complexity of Algorithm 
1 is 0{KNJ^ + KNJlogJ + KJ^). Although there is no theoretical guarantee of the convergence of our 
algorithm, each subproblem does have an exact solution and we observe that it converges in a few hundred 
iterations on training datasets (see Section [5T]) . 
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Algorithm 1 Computing LSDT Bases for Mocap Data 

Input: training samples the sparsity parameter P and the maximum number of iterations K 

Output: the orthogonal matrix 

1 : initialize using an orthogonal matrix (e.g., DCT or DWT bases) 

2: for iter 1 : A do 
3: for I ■(— 1 : A do 

4: update ef using ([5]) 

5: end for 

6 : factor using SVD 

7: update B'^ using (EH) 

8 : end for 


Table 1: Description of training sequences and test sequences. 


Sequence 

F 

Size (kB) 

Description 

86.02 

10,617 

3,856.9 

walk, squats, run, stretch, jumps,punches, and drinking 

56.04 

6767 

2,458.3 

fists up, wipe window, grab, walk, throw punches, yawn, stretch, jump 

15.05 

22948 

8,336.5 

wash windows, paint, hand signals, dance, dive, twist, boxing 

14.08 

2,625 

953.6 

jump up to grab 

15.04 

22,549 

8,191.6 

dance, the twist, boxing 

17.08 

6,179 

2,244.7 

muscular person’s walk 

17.10 

2,783 

1,011 

boxing 

41.07 

7,536 

2,737.7 

climb, step over, jump over 

49.02 

2,085 

757.4 

jump, hop on one foot 

56.07 

9,420 

3,422.1 

yawn, stretch, walk, run, halt 

85.12 

4,499 

1,634.4 

jumps, flips, breakdance 

86.05 

8,340 

3,029.7 

walking, jumping, punching 


5. Experimental Results and Discussion 

We implement our methods in MATLAB using only 200 lines of codes and evaluate them on the CMU 
Mocap Databas^H, in which each frame consists of J = 31 key points (i.e., joints of the human skeleton) 
sampled at 120 frames per second (fps). We store each coordinate of the original data as a 32-bit float and 
hereby represent one key point using 96 bits. Table [T] describes the training and test motion sequences and 
their lengths. 

The compression distortion D is measured by the average Euclidean distance between the original joint 
location pij := and the reconstructed location pij := (in cm), 

1 ^ 

^ ~ l|p»,j ~ P»,j ll::> ■ (12) 

i=l i=l 

The compression ratio (CR) is the ratio between the original data size and the compressed data size. The 
compression is determined by the quantization bit, that is, a larger quantization bit induces smaller distortion 
at a smaller CR. 

5.1. Training the LSDT Bases 

We take sequences “86_02” “56_04”, and “15_05” as the training datasets, which consist of various types 
of human motion. It is worth noting that more training frames can generate better performance, but the 
computational cost also increases. Thus, it is a tradeoff between quality and efficiency. 


^http://mocap.cs.cmu.edu/ 
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Figure 2: Visualization of the ID DCT and LSDT bases, where the greyscale color indicates the normalized function value. In 
each square matrix, a column corresponds to one basis function and frequencies increase from left to right. 





(a) B- (b) By (c) B- 


Figure 3: Convergence plots of Algorithm ^ with two different initializations. training frames V=10,617; sparsity parameter 
P=S. (a), (b), and (c) correspond to x, y, and ^-coordinates, respectively. 


The LSDT bases training algorithm (cf. Algorithm [T]) is an iterative algorithm. We evaluate the con¬ 
vergence rate of the training algorithm on two types of initializations, ID DCT bases and ID DWT bases 
realized by the 3-level “Haar” wavelet. As Figure |3] shows, the objective function converges to almost the 
same value after a few hundred iterations, meaning that the output of Algorithm [T] is intrinsic, which does 
not depend on initialization. Figure [5] also visualizes the bases of ID DCT and LSDT to show the difference 
between them. 

The parameter P, specifying the sparsity of transform coefficients during the learning procedure, di¬ 
rectly affects the structure of the learned orthogonal matrix B'^, which in turn controls the compression 
performance. In the training process, we set P to four different values: 2, 5, 8, and 11. Then, the learned or¬ 
thogonal matrices under different P are tested in the frame- and clip-based methods, respectively. For both 
schemes, four randomly chosen sequences with various motion characteristics and lengths are compressed, 
and the results are shown in Figured where we can see that the best compression performance is achieved 
when the value of P is equal to 8. 

5.2. Evaluating the Spatial Decorrelation Transforms 

We compare the performance of several spatial decorrelation transforms, including LSDT, spatial DCT, 
and spatial DWT. We apply each transform to the x, y and z components of each frame separately, and 
examine the relationship between the percentage of nonzero transform coefficients and the distortion. As 
Figure [S] shows, given the same number of nonzero transformed coefficients, the distortions produced by 
LSDT are consistently much smaller than those of DCT and DWT, meaning that LSDT concentrates energy 
(or spatially decorrelated mocap data) better than DCT and DWT. 
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(a) 14.08 


(b) 17.08 


(c) 41.07 


(d) 49.02 


Figure 4: The impact of the sparsity parameter P on the overall compression performance. The top and bottom rows correspond 
to the frame- and clip-based (L = 240) schemes, respectively. B'* is initialized using the DCT bases. 



(a) 14.08 




(b) 17.08 (c) 41.07 



Percentage of nonzero coefficients {%) 


(d) 56_07 


Figure 5: Evaluating the performance of spatial decorrelation of the proposed LSDT. The horizontal axis shows the percentage 
of nonzero transformed coefficients. LSDT performs the best among the three SDTs. 


5.3. Compression Performance 

Figure [5] shows the CR-distortion (CR-D) curves of the frame-based scheme. As Section [Ql shows, our 
data-adapted LSDT is superior to the data-independent ID DCT for spatial decorrelation. Therefore, it 
is not surprising that our frame-based scheme significantly outperforms the ID DCT-based method 0 in 
terms of compression performance. We observe that with a relatively high CR, our frame-based scheme can 
reduce up to 70% distortion of 0. 

Figure [7] shows the CR-D curves of the clip-based scheme, from which we observe the following: 

1. As expected, the clip-based scheme has much better compression performance than the frame-based 
scheme, since it can exploit the temporal coherence better. At the same time, users can easily control 
the latency for the clip-based scheme. Taking the CMU mocap data which are sampled at 120 fps as 
example, the clip length L = 120 (resp. 240) means 1 second (resp. 2 seconds) latency. 

2. The compression performance of the proposed clip-based scheme can be improved by increasing the 
clip length (or latency). More specifically, when L ranges from 60 to 120, the trajectories in a clip still 
remain smooth and have small variation (due to the small duration), causing the DCT coefficients to 
be distributed at similar locations, which can then be encoded using a similar number of bits. Since 
the number of clips in the sequence decreases, the total number of bits to encode one sequence (i.e., 
the sum of bits for all clips in one sequence) is significantly reduced, leading to higher compression 
performance. However, the improvement is little when the value of L increases from 120 to 240. The 
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(a) 15.04 (b) 17.08 (c) 17.10 (d) 41.07 






(e) 49.02 


(f) 56.07 


(g) 85.12 


(h) 86.05 


Figure 6: Comparison of compression performance of frame-based methods. 


reason is that the joint trajectories change more significantly. As a result, the DCT coefficients are 
spread out, requiring more bits for encoding. Although the number of clips decreases, the total number 
of bits for one sequence only drops slightly. 


5.4- Comparison 

Table [5] qualitatively compares our methods with the existing works in terms of latency, computational 
cost, implementation, compression performance, and the number of parameters used in the encoding pro¬ 
cess. Note that all methods have a quantization parameter to specify the number of bits used to quantize a 
coefficient. We do not include this quantization parameter in Tabled since it is a hxed parameter according 
to bandwidth. Also note that the sparsity parameter P in our method appears only in the training stage. 

In this subsection, we compare our clip-based scheme with only two works, namely PCA-Rate Distortion 
Optimization (PCA-RDO) method Q, and the equal segmentation case of Mocap Data Tailored Transform 
(MDTT) method 0, which represent the state-of-the-art. See Q and 0 for detailed performance evaluation 
on earlier works |lll . iiSEg. 


5.4-1. Comparison with the MDTT Method 

Both our clip-based algorithm and the MDTT method Q apply temporal DCT to each trajectory for 
temporal decorrelation. The two methods differ fundamentally in spatial decorrelation. For each mocap 
sequence, the MDTT method segments the motion sequence into short clips, and compute a set of orthogonal 
basis functions tailored for all clips together, resulting in better decorrelation at the price of a large latency 
and overhead for storing the data-dependent basis functions. Within our clip-based method, the LSDT 
bases are adapted to all mocap data, therefore, there is no need to store the bases for each sequence. 

The MDTT method adopts low-rank approximation, which is a linear approximation, to reduce the 
dimension of transformed coefficients. In contrast, the LSDT makes the transform coefficients s par se by 
quantization, which is a nonlinear approximation and more flexible. It has been pointed out in 29|,l30| that 
the nonlinear approximation outperforms the linear approximation in data compression. 

From the CR-D curves in Figured we observe the MDTT 0 has better performance than our scheme 
for long motion sequences (e.g., 15.04 and 56.07), where the overhead of storing MDTT bases (compared 
with the transformed coefficients) is very small so that it can be ignored. However, for short sequences (e.g., 
17.10 and 49.02), the space usage for storing the basis functions in the MDTT is comparable to that of 
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Table 2: Qualitative comparison of various mocap compression methods. The latency is measured in number of frames. 
the number of parameters used in the encoding process; Fs'. the number of frames in a mocap sequence; Fc'. the numbers of 
frames in a short clip, and Fs ^ Fc- Note that the quantization parameter is not included in for all methods. 


Category 

Method 

Latency 

#P 

Computational 

cost 

Implementation 

Compression 

performance 

PCA-based 

V^a and Brunnett [7| 

Fs 

5 

high 

fair 

high 

Lin et al. 

Fs 

3 

fair 

difficult 

medium 

Arikan [11| 

Fc 

3 

fair 

fair 

low 

Liu et aT~ll2l 

Fc 

3 

fair 

fair 

low 

Tournier et al. 

Fs 

2 

high 

fair 

medium 

DCT/ 

DWT-based 

Kwak and Bajic [IQl 

0 

0 

low 

easy 

low 

Chew et al. 1131 

Fc 

2 

fair 

easy 

medium 

Firouzmanesh et al. [22| 

Fc 

3 

low 

easy 

low 

Mocap Data Favored 
Transform 

Zhu et al. [231 

Fs 

3 

high 

difficult 

medium 

Hou et al. 

Fs 

2 

low 

easy 

high 

Hou et al. 

Fs 

2 

fair 

fair 

medium 

Our frame-based method 

0 

0 

low 

easy 

high 
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the transformed coefficients, leading to a large overhead. As a result, its compression performance is not as 
good as ours. For remaining sequences, the MDTT method is comparable to ours. 

Our clip-based method and the MDTT method have similar runtime performance, which can process 
more than 10,000 frames per second on an Intel Core 17-3770 CPU (3.40 GHz). 

In summary, both methods have merits. The mocap tailored transform is suitable for long motion 
sequences in the applications where large latency is tolerated, while our methods work for both short and 
long sequences and are desired for time-critical applications such as streaming. 

5.4-2. Comparison with the PCA-RDO Method 

The PCA-RDO method Q is a PCA-based approach, which adopts PCA twice. In the first round, 
it applies PCA to the entire motion sequence to obtain reduced orthogonal basis of pose space. This 
PCA, called posed space PCA, is to explore the spatial correlation. Then, applying PCA to clips, it obtains 
orthogonal basis for joint trajectories. The second PCA, called temporal PCA, is for temporal decorrelation. 
With two rounds of PCA, the data dimension is reduced significantly. Vasa and Brunnett Q also proposed 
a general preprocessing step based on Lagrange multipliers, which allows the user to optimize with respect 
to various error metrics. 

Our clip-based method and the PCA-RDO method differ in several aspects: First, the PCA-RDO method 
is sequence-based, thus, it has large latency, whereas ours is clip-based and has low latency. Second, it is 
known that compression of the PCA’s orthogonal basis is difficult, although their method adopts an advanced 
predictive coding |3l| As Figures [7Ka)(b)(c)(g) show, our clip-based scheme consistently outperforms the 
PCA-RDO method [7| in terms of compression performance. Third, similar to the MDTT method, the 
PCA-RDO method is also low-rank approximation-based. So, it is not as flexible as ours. Fourth, the 
PCA-RDO algorithm has high computational cost and we observe that the speed of our clip-based method 
is 3 to 4 times faster than theirs. Last but not least, tuning the parameters of the PCA-RDO method is 
tedious and non-intuitive. In contrast, within our clip-based method, the user only needs to specify the clip 
length L, which directly controls the latency. 

Finally, Figures [5] and [3] show some visual results of our methods, the DCT-based, and the MDTT to 
further demonstrate the advantage of our methods. 

5.5. Diseussion 

We formulate the LSDT problem as a least square with orthogonal constraint. In fact, a non-orthogonal 
matrix may produce even better compression performance. However, one has to employ other constraints 
(e.g., using the determinant of B'^ and Frobenius norm of B'^) to ensure the learned matrix invertible (i.e., 
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Figure 7: Compression performance of our clip-based schemes and the state-of-the-art methods, such as the PCA-RDO method 
0 and the MDTT method Q. For MDTT, we adopt equal segmentation with L = 240 and follow Q to set the other parameters. 
The results of PCA-RDO were taken from Q . 


ensure existence of the inverse transform) and a small condition number. Correspondingly, the optimization 
problem becomes complicated and it is difficult to solve. 


6. Conclusion 


We presented frame- and clip-based methods for compressing mocap data with low latency. Taking 
advantage of the unique spatial characteristics, we proposed learned spatial decorrelation transform to 
effectively reduce the spatial redundancy in mocap data. Due to its data adaptive nature, LSDT outperforms 
the commonly used data-independent transforms, such as discrete cosine transform and discrete wavelet 
transform, in terms of the decorrelation performance. Experimental results show that the proposed methods 
can produce higher compression ratios at a lower computational cost and latency than the state-of-the-art 
methods. 

In our current implementation, we compress 3D position-based mocap data defined on a skeleton graph. 
However, it is straightforward to apply our methods to other types of mocap data, such as facial expressions, 
hand gestures and motion of human bodies. In the future, we will extend our methods to compress mocap 
data represented by Euler angles. Due to the nonlinear nature of ang les, the hierarchical structure may 
produce significant accumulation errors in the compressed data 
techniques to tackle this challenge. 


11, 13|. We will seek effective data-driven 
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Figure 8: Visual results comparison of frame-based schemes. The distortions are colored as heat map, and the frames are 
uniformly extracted from the sequences. Left: the DCT-based method in Right: our frame-based scheme. 
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Figure 9: Visual results comparison of our clip-based scheme and the MDTT method [^. The joint distortions are colored in 
heat map, and the frames are uniformly extracted from the test sequences. Left: MDTT; Right: our clip-based scheme. 
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